Random Forest Algorithm

An analysis of the random forest algorithm and its applications in the health industry

Maddie Sortino and Jisa Jose (Advisor: Dr. Cohen)

2025-04-22

Introduction

Overview of Random Forest Algorithm

  • Random Forest (RF) is a widely used ensemble machine learning algorithm built on decision trees.
  • It combines two key techniques:
    • Bootstrapping (bagging): Creates multiple subsets of data for training.
    • Random feature selection: Randomly selects features at each split to reduce correlation between trees.

Pros & Cons

Overview of the strengths and limitations of using Random Forest for clinical applications

Advantages in Healthcare:

  • Handles complex, high-dimensional clinical data
  • Robust against overfitting, noise, and missing data
  • Flexible for both classification and regression
  • Provides insights through feature importance

Limitations:

  • High computational requirements
  • Longer training time with large datasets
  • Lower interpretability (compared to simpler methods)
  • Performance may plateau with excessive tuning

Real World Applications

Study Objective

  • Predict heart disease using Random Forest and structured clinical data

  • Compare baseline model with tuned Random Forest model

  • Evaluate model performance using key metrics:

    • Accuracy
    • Precision
    • Recall
    • F1 Score
    • AUC-ROC (Area under the Receiver Operating Characteristic curve)

Method Overview

  • Step 1: Single Decision Tree (Baseline)
    • Built a simple decision tree to establish a baseline and gain initial insights into the data
  • Step 2: Random Forest Model
    • Improved predictions using a Random Forest with 100 decision trees
  • Step 3: Hyperparameter Tuning
    • Optimized model performance using 5-fold cross-validation
    • Tuned the mtry parameter (number of features considered at each split)

Hyperparameter Tuning Explanation

  • Hyperparameters are adjustable settings that influence how the model learns from data

  • In Random Forest, hyperparameters control:

    • How trees are built
    • How features are selected
    • How predictions are aggregated
  • Proper tuning helps:

    • Improve predictive accuracy
    • Reduce over fitting
    • Ensure the model generalizes well, especially in sensitive fields like healthcare

Evaluation Metrics

  • Accuracy: Measures overall correctness of the model.
    \[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]

  • Precision: Measures how many predicted positives are truly positive.
    \[Precision = \frac{TP}{TP + FP}\]

  • Recall (Sensitivity): Measures how many actual positives were correctly identified.
    \[Recall = \frac{TP}{TP + FN}\]

  • F1 Score: Balances precision and recall. A good measure for imbalanced datasets.
    \[F1 = \frac{2 \cdot (Precision \cdot Recall)}{Precision + Recall}\]

  • AUC-ROC: Area under the ROC curve; shows how well the model distinguishes between classes.

    • Higher AUC means better class separation performance.

(TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative)

Dataset Overview

  • Source: Kaggle – Heart Failure Prediction Dataset
  • Total Records: 918 patient entries, with 11 clinical and demographic features

Key Features:

  • Demographic / General Clinical Features:
    Age, Sex, RestingBP (Resting Blood Pressure), Cholesterol, MaxHR (Maximum Heart Rate Achieved), FastingBS (Fasting Blood Sugar)

  • Cardiac-Specific Clinical Features:
    ChestPainType, RestingECG (Resting Electrocardiogram results), ExerciseAngina (Exercise-induced Angina), Oldpeak (ST Depression), ST_Slope (Slope of ST segment)

Target Variable:

HeartDisease (1 = indicates heart disease, 0 = no heart disease)

Table 1: Data Structure Overview

Data Structure Overview
Column Type values
Age integer 40, 49, 37, 48, 54
Sex character M, F, M, F, M
ChestPainType character ATA, NAP, ATA, ASY, NAP
RestingBP integer 140, 160, 130, 138, 150
Cholesterol integer 289, 180, 283, 214, 195
FastingBS integer 0, 0, 0, 0, 0
RestingECG character Normal, Normal, ST, Normal, Normal
MaxHR integer 172, 156, 98, 108, 122
ExerciseAngina character N, N, N, Y, N
Oldpeak numeric 0, 1, 0, 1.5, 0
ST_Slope character Up, Flat, Up, Flat, Up
HeartDisease integer 0, 1, 0, 1, 0

Summary Statistics

Characteristic N = 9181
Age 54 (47, 60)
Sex
    F 193 (21%)
    M 725 (79%)
ChestPainType
    ASY 496 (54%)
    ATA 173 (19%)
    NAP 203 (22%)
    TA 46 (5.0%)
RestingBP 130 (120, 140)
Cholesterol 223 (173, 267)
FastingBS 214 (23%)
RestingECG
    LVH 188 (20%)
    Normal 552 (60%)
    ST 178 (19%)
MaxHR 138 (120, 156)
ExerciseAngina
    N 547 (60%)
    Y 371 (40%)
Oldpeak 0.60 (0.00, 1.50)
ST_Slope
    Down 63 (6.9%)
    Flat 460 (50%)
    Up 395 (43%)
HeartDisease 508 (55%)
1 Median (Q1, Q3); n (%)

(Table 2: Summary Statistics of Heart Disease dataset)

Key Insights & Observations

  • Data Quality:
    • No missing or duplicate records
    • Some unrealistic values (e.g., cholesterol = 0) were cleaned prior to modeling
  • Gender Distribution:
    • 79% male, 21% female
    • Potential influence on model fairness and generalization
  • Chest Pain Type:
    • 54% of patients reported asymptomatic chest pain (ASY)
    • Indicates a high number of silent or undiagnosed cases
  • Heart Disease Distribution:
    • 55.3% diagnosed with heart disease
    • 44.7% undiagnosed
    • Fairly balanced for effective classification

Distribution of Features

Figure 1: Distribution of some Features

Code
# The distribution of 'Age' with a histogram - normal distribution
ageplot <- ggplot(data, aes(x = Age)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(title = "Age Distribution", x = "Age", y = "Count")

# The distribution of 'Heart Disease' with a histogram - no class imbalance
hdplot <- ggplot(data, aes(x = HeartDisease)) +
  geom_bar(fill = "blue", color = "black") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(breaks=c(0,1)) +
  labs(title = "Heart Disease Class Distribution", x = "Heart Disease", y = "Count")

# The distribution of 'Sex' with a histogram - imbalance: ~4x more males than females
splot <- ggplot(data, aes(x = Sex)) +
  geom_bar(fill = "blue", color = "black") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(title = "Sex Distribution", x = "Sex", y = "Count")

# The distribution of 'Cholesterol' with a histogram - over 150 records with a cholesterol of 0; otherwise normal distribution
cplot <- ggplot(data, aes(x = Cholesterol)) +
  geom_histogram(fill = "blue", color = "black") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(title = "Cholesterol Distribution", x = "Cholesterol", y = "Count")

grid.arrange(hdplot, splot, ageplot, cplot, ncol = 2, nrow=2)
  • Distribution/trends which are relevant in predicting heart diseases.

Distribution Explanations

  • Heart Disease Distribution (Top-Left): The distribution is appropriately balanced, minimizing the chances of bias in the model.
  • Sex Distribution (Top-Right): The dataset has more male patients than female, which might impact predictions.
  • Age Distribution (Bottom-Left): Most patients fall within the 40-70 years age range, with the data being normally distributed.
  • Cholesterol Distribution (Bottom-Right): The presence of zero values in cholesterol is unrealistic, indicating the need for data cleaning.

Correlation Matrix

Figure 2: Correlation Matrix – Understanding Key Relationships

Code
# correlation matrix for numeric features
library(corrplot)
numeric_data <- data %>% select(where(is.numeric))
cor_matrix <- cor(numeric_data)
corrplot(cor_matrix, method = "circle", type = "upper", 
         tl.col = "black", tl.cex = 0.7, addCoef.col = "black")

Correlation Matrix Insights

  • Figure 2 shows correlations of features in the data set with heart disease.
  • Oldpeak (0.40), or ST depression, stands out since higher ST depression is associated with heart disease risk. This is a measurement on an ECG, indicating reduced blood flow to heart.
  • Patients who have heart disease are more likely to have lower maximum heart rate (MaxHR (-0.40)).
  • Age (0.28) and Fasting Blood Sugar (0.27) also emerged as positive correlates, confirming that older people and people with high fasting blood sugar levels are at risk.

Modeling and Results

  • Begin with performing any necessary cleaning and preprocessing of the data.

  • We will then use the decision tree algorithm to demonstrate how a decision tree works, and show the performance of one tree.

  • The next step will be using the random forest algorithm, which is a combination of decision trees, to see how it performs in comparison, ideally providing a more accurate prediction.

Data Preprocessing and Cleaning

  • No null or NA missing values found in the data set
  • One row with a RestingBP = 0
  • 172 rows with Cholesterol = 0
  • We decided to drop these rows from the data set, as they were missing valid data.

Data Encoding

  • The next step is to encode the data.
  • The random forest alogorithm, along with most machine learning algorithms, functions best with numerical values.
  • Use one-hot encoding to transform the categorical variables into a binary column that indicates the presence (1) or absence (0) of the category.

Encoded Data Preview

Age SexF SexM ChestPainTypeASY ChestPainTypeATA ChestPainTypeNAP ChestPainTypeTA RestingBP Cholesterol FastingBS RestingECGLVH RestingECGNormal RestingECGST MaxHR ExerciseAnginaN ExerciseAnginaY Oldpeak ST_SlopeDown ST_SlopeFlat ST_SlopeUp HeartDisease
40 0 1 0 1 0 0 140 289 0 0 1 0 172 1 0 0 0 0 1 0
49 1 0 0 0 1 0 160 180 0 0 1 0 156 1 0 1 0 1 0 1
37 0 1 0 1 0 0 130 283 0 0 0 1 98 1 0 0 0 0 1 0

Splitting Data

  • We split the data set into training and test subsets.

  • The training subset will contain 70% of the data.

  • The test subset will contain 30% of the data.

Model Fitting and Prediction

Decision Tree

  • We first demonstrate how a single decision tree would look for our data set.
  • We achieved an accuracy of 81.7% without hyperparameter tuning.
  • The decision tree can be followed to determine what the predicted end result would be.
  • For example, if the patient has ST_SlopeUp = 1 and ChestPainTypeASY = 0; they likely do not have heart disease. If the patient has ST_SlopeUp = 0, MaxHR < 151, SexF = 0, then the patient likely does have heart disease.

Decision Tree

Random Forest

  • After gaining an understanding of how a single decision tree functions; we proceed with the bulk of our analysis using the random forest algorithm.
  • We trained the random forest using 100 trees
  • The random forest achieved an accuracy of 88.4%, which is higher than the 81.7% obtained from the single decision tree, as expected.
  • The confusion matrix shows that there were 104 true negatives (HeartDisease=0), 94 true positives (HeartDisease=1), 13 false negatives, and 13 false positives.

Confusion Matrix - Default

Hyperparameter Tuning

  • After achieving an accuracy of 88.4% on the initial random forest model built using default parameters, we used hyperparameter tuning to improve the model further.
  • Tuning the model is a crucial step in machine learning because the default values may not be the most accurate or generalizable (Probst, Wright, and Boulesteix 2019).
  • We conducted model enhancement by implementing 5-fold cross-validation in the process of tuning the key parameter mtry, which is the number of variables that are randomly chosen at every split of the tree.
  • Area Under the ROC Curve (AUC) was used as the main optimization metric.
  • The AUC was chosen because it assesses a model’s performance and does not take into account classification thresholds, which matters greatly in binary classification problems with an imbalanced dataset.
  • mtry=3 achieved the greatest mean AUC (0.9302) which suggested that this setting provided the optimal compromise between overfitting and underfitting (Oshiro, Perez, and Baranauskas 2012).

Confusion Matrix - Tuned Model

Area Under Curve

(Figure 6: ROC curve with an AUC score of both basic RF Model & tuned RF Model)

Analysis Summary

  • The ROC (Receiver Operating Characteristic) curve displays the balance of sensitivity (true positive rate) against specificity (false positive rate) for different thresholds.
  • The ROC curve indicates strong separation between classes
  • The basic random forest model had AUC score of .8837 and the tuned random forest model had an AUC score of 0.9371.
  • This shows how hyperparameter tuning can optimize accurately and reliably diagnosing heart disease.
  • An AUC above 0.90 is typically considered excellent(Fawcett 2006).

Feature Importance

  • We also looked for feature importance, which showed ST_SlopeUp, ChestPainTypeASY, and ST_SlopeFlat to be some of the most important predictors.
  • This is consistent with medical domain knowledge since changes in ST segments and chest pain types are known markers of cardiac abnormality (Khalilia, Chakraborty, and Popescu 2011).
  • The model’s ability to capture meaningful physiological patterns is also supported by the high ranking of MaxHR and Oldpeak.

Feature Importance

(Figure 7. Feature importance based on the average decrease in Gini index.)

Conclusion

  • The aim of the study was to evaluate the Random Forest algorithm.
  • Using a data set to predict heart disease, we compared the results using default settings against a model that was optimized after hyperparameter tuning.
  • Both models provided satisfactory classification properties (>80% accuracy)
  • The optimized model showed improvement when compared against the initial model when comparing AUC values
  • The tuned model had an accuracy of 88.4% and sensitivity of 94.9% with the corresponding F1 score of 0.892.
  • From a practical perspective, this study has important implications for the use of Random Forests in clinical decision support.
  • The algorithm’s robust nature and ability to handle various data types makes it an excellent algorithm for healthcare applications.
  • The performance gain after tuning implies that default hyperparameters can be used as a baseline for beginning evaluation or determining which model type may be most appropriate, but tuning should be taken into consideration.
  • Random Forest can provide healthcare professionals with support in terms of risk and diagnostics

References

Fawcett, Tom. 2006. “An Introduction to ROC Analysis.” Pattern Recognition Letters 27 (8): 861–74. https://doi.org/10.1016/j.patrec.2005.10.010.
Khalilia, Mohammed, Sounak Chakraborty, and Mihail Popescu. 2011. “Predicting Disease Risks from Highly Imbalanced Data Using Random Forest.” BMC Medical Informatics and Decision Making 11: 1–13. https://doi.org/10.1186/1472-6947-11-51.
Khine, Wai Wai, and Zaw Tun. 2022. “Diabetes Prediction Based on Machine Learning Algorithms: MNB, Random Forest, SVM.” IEEE Access. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10754937.
Mbonyinshuti, François, Jean Nshimiyimana, and Claude Uwitonze. 2022. “Application of Random Forest Model to Predict the Demand of Essential Medicines for Non-Communicable Diseases Management in Public Health Facilities.” Pan African Medical Journal 42: 89. https://pmc.ncbi.nlm.nih.gov/articles/PMC9379432/.
Oshiro, Thais Mayumi, Pedro Santoro Perez, and José Augusto Baranauskas. 2012. “How Many Trees in a Random Forest?” In Machine Learning and Data Mining in Pattern Recognition: 8th International Conference, MLDM 2012, Berlin, Germany, July 13-20, 2012. Proceedings 8, 154–68. Springer. https://doi.org/10.1007/978-3-642-31537-4_13.
Probst, Philipp, Marvin N. Wright, and Anne-Laure Boulesteix. 2019. “Random Forest Algorithms in Health Care Sectors: A Review of Applications.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (3): e1301. https://www.researchgate.net/publication/358128515_Random_Forest_Algorithms_in_Health_Care_Sectors_A_Review_of_Applications.
Rigatti, Steven J. 2017. “Random Forest.” Journal of Insurance Medicine 47 (1): 31–39. https://doi.org/10.17849/insm-47-01-31-39.1.